Support abort background compaction jobs.#14227
Support abort background compaction jobs.#14227xingbowang wants to merge 8 commits intofacebook:mainfrom
Conversation
b112174 to
405863c
Compare
Summary: This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning. The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed. This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations. This also adds a new public API to resume compactions after the call to abort. Limitation: compaction service is not support. Test Plan: - Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees - Updated compaction_job_test.cc to include the new parameter - Stress test
405863c to
92ef881
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994. |
| } | ||
| for (const std::string& file_path : | ||
| sub_compact.Outputs(is_proximal_level)->GetOutputFilePaths()) { | ||
| Status s = env_->DeleteFile(file_path); |
There was a problem hiding this comment.
Wouldn't a compaction with non-ok status get automatically cleaned up? Why do we need to explicitly do the cleanup here?
There was a problem hiding this comment.
The normal cleanup path (Cleanup() in compaction_outputs.h lines 247-253) only abandons in-progress builders. This does not delete already-finished output files that were successfully written to disk.
When compaction runs with multiple subcompactions in parallel:
- Subcompaction A completes successfully → produces finished SST/blob files on disk
- Subcompaction B gets aborted (or the overall compaction is paused)
The overall compaction status becomes CompactionAborted or ManualCompactionPaused.
At this point, Subcompaction A's output files are fully written and finished on disk
But the overall compaction is aborted, so these files will never be installed to the LSM tree
Without explicit cleanup, these files become orphans on disk
There was a problem hiding this comment.
Interesting. Thanks for the clarification. I think you're right. Is this a problem with subcompactions in general then? For example, if subcompaction B fails due to IO error, then there's no cleanup of subcompaction A's files. Not saying that needs to be addressed in this PR since its a separate issue, but is it something to be tracked?
There was a problem hiding this comment.
I did some more investigation around this. There is another function FindObsoleteFiles that scan directories to find files not in any live version, and perform clean up, on compaction or flush failure. We could rely on that for rare failure such as IO error. For abort operation, we could switch to that as well. However, it would break resumable compaction, as FindObsoleteFiles does not know a compaction is resumable or not.
db/compaction/compaction_job.cc
Outdated
| const uint64_t num_records = c_iter->iter_stats().num_input_records; | ||
|
|
||
| // Periodic cron operations: stats update, abort check, and sync points | ||
| if (num_records % kCronEvery == kCronEvery - 1) { |
There was a problem hiding this comment.
Nit: Can we avoid the % (or make kCronEvery a power of 2)?
| // max_subcompactions values | ||
| class DBCompactionAbortSubcompactionTest | ||
| : public DBCompactionAbortTest, | ||
| public ::testing::WithParamInterface<int> {}; |
There was a problem hiding this comment.
I would add a comment specifying what exactly the param is for
db/db_compaction_abort_test.cc
Outdated
| ConfigureOptionsForStyle(options, style); | ||
| Reopen(options); | ||
|
|
||
| // Use larger value size for Universal compaction to ensure compaction work |
There was a problem hiding this comment.
Could you elaborate a bit more? Why wouldn't it work? If not having a specific amount of work breaks timing of the test, it may not be ideal
There was a problem hiding this comment.
I forgot to clean this up. We no long this special configuration after tuning the parameter. Remove the specialization.
| } | ||
| for (const std::string& file_path : | ||
| sub_compact.Outputs(is_proximal_level)->GetOutputFilePaths()) { | ||
| Status s = env_->DeleteFile(file_path); |
There was a problem hiding this comment.
Interesting. Thanks for the clarification. I think you're right. Is this a problem with subcompactions in general then? For example, if subcompaction B fails due to IO error, then there's no cleanup of subcompaction A's files. Not saying that needs to be addressed in this PR since its a separate issue, but is it something to be tracked?
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994. |
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994. |
|
@xingbowang merged this pull request in 656b734. |
Summary: This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning. The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed. This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations. Pull Request resolved: facebook#14227 Test Plan: - Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees - Updated compaction_job_test.cc to include the new parameter Reviewed By: anand1976 Differential Revision: D91480994 Pulled By: xingbowang fbshipit-source-id: 36837971d8a540cd34d3ec28a78bc94b582625b0
Summary:
This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning.
The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed.
This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations.
Test Plan: